De novo genome assembly of a high-protein soybean variety HJ117

Objectives Soybean is an important feed and oil crop in the world due to its high protein and oil content. China has a collection of more than 43,000 soybean germplasm resources, which provides a rich genetic diversity for soybean breeding. However, the rich genetic diversity poses great challenges to the genetic improvement of soybean. This study reports on the de novo genome assembly of HJ117, a soybean variety with high protein content of 52.99%. These data will prove to be valuable resources for further soybean quality improvement research, and will aid in the elucidation of regulatory mechanisms underlying soybean protein content. Data description We generated a contiguous reference genome of 1041.94 Mb for HJ117 using a combination of Illumina short reads (23.38 Gb) and PacBio long reads (25.58 Gb), with high-quality sequence coverage of approximately 22.44× and 24.55×, respectively. HJ117 was developed through backcross breeding, using Jidou 12 as the recurrent parent and Chamoshidou as the donor parent. The assembly was further assisted by 114.5 Gb Hi-C data (109.9×), resulting in a contig N50 of 19.32 Mb and scaffold N50 of 51.43 Mb. Notably, Core Eukaryotic Genes Mapping Approach (CEGMA) assessment and Benchmarking Universal Single-Copy Orthologs (BUSCO) assessment results indicated that most core eukaryotic genes (97.18%) and genes in the BUSCO dataset (99.4%) were identified, and 96.44% of the genomic sequences were anchored onto twenty pseudochromosomes.

content is influenced by complex factors such as genotype, environment, and genotype-environment interactions [3,4].Due to the strong negative correlations of soy protein content and oil content [4] with yield [5], it is quite difficult to increase soy protein content.
In the early stages of soybean breeding, farmers primarily relied on repeatedly selecting preferred seeds from cultivated populations [6].Following that, artificial hybridization technology was introduced, and the initial artificially hybridized cultivated soybean was introduced in North America during the 1940s [7].With the development and progress of molecular biology technology, marker-assisted selection (MAS) has been employed to expedite the breeding process [8].The publication of the initial reference genome of soybean (cultivar Williams 82) in 2010 [9] signaled the commencement of the soybean functional genomics research era [10,11].The enhancement of sequencing technologies has significantly boosted the capacity to generate high-quality genome assemblies.

Data description
The Glycine max sample was collected from Shijiazhuang (37°6′25″N, 114°42′47″E).Genomic DNA and total RNA were isolated from leaf tissues.High-quality DNA was extracted using QIAGEN® Genomic kits.Three methods were used to quantify and check the extracted DNA, NanoDrop 2000 Spectrophotometer (Thermo Fischer Scientific), agarose gel electrophoresis and Qubit Fluorometer (Invitrogen).After the detection, the DNA was purified using AMPure PB beads (Pacbio 100-265-900), and the subsequent library construction utilized the final high-quality genomic DNA (gDNA).The size and concentration of the library fragments were assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies, USA).Qualified libraries were evenly loaded on SMRT Cell and sequenced for 30 h using Sequel II/IIe system (Pacific Biosciences, CA, USA).
Briefly, the DNA sample was initially fixed with formaldehyde and subsequently digested using HindIII restriction enzyme.Next, the DNA ends underwent repair and were labeled with biotin.Subsequently, T4 DNA ligase was used to ligate the interacting fragments to form a loop.After ligation, protease K was added for crosslinking, and then protein of ligated DNA fragments was digested to obtain purified DNA.Finally, the purified DNA was fragmented into sizes ranging from 300 to 500 base pairs.The biotin-labeled DNA fragments were then isolated using Dynabeads® M-280 Streptavidin (Life Technologies).Subsequently, the Hi-C library was constructed and sequenced on the Illumina NovaSeq6000 sequencing platform using paired-end reads of 150 base pairs.
To ensure the acquisition of high-quality data, the raw polymerase reads were subjected to quality control using the PacBio SMRT-Analysis package (https://www.pacb.com).This involved filtering out the following types of polymerase reads: (1) polymerase reads less than 50 bp in length, (2) Polymerase readings with a mass value below 0.8, (3) a polymerase read comprising an adaptor attached to itself and removing the adaptor sequence in the polymerase read.Then use SMRTLink 9.0 (parameter --min-passes = 3 --min-rq = 0.99) to generate CCS reads for subsequent assembly.
Hifiasm (https://github.com/chhylp123/hifiasm)was employed to assemble the HiFi reads, and the preliminarily assembled genome version (primary contigs) was obtained.To obtain chromosome level genome, we performed Hi-C assisted assembly.For the ~114.5 Gb raw reads (Data file 1 and Data file 2), preliminary quality control was performed using Fastp [14], and the resulting clean reads were subsequently aligned to primary contigs using hicup.Valid pair reads were utilized for further analysis.AllHIC was used for auxiliary assembly, and then Juicebox was used for fine-tune AllHIC clustering results.Finally, A genome was obtained with a contig N50 length of 19.32 Mb and a total contig length of 1041.94Mb, as well as a scaffold N50 length of 51.43 Mb and a total scaffold length of 1041.95Mb (Data file 3 and Data file 4).
To assess the quality of the assembly the self-written script was used to perform statistics on the number of single chromosome cluster scaffolds, chromosome sequence length, and genome mounting rate.According to the number of sequences assembled to the chromosome level and the number of sequences that were not assembled to the chromosome level, the Hi-C mounting rate was calculated.The chromosome-level genome was partitioned into 500 Kb bins of equal length.The number of Hi-C read pairs spanning any two bins was used as the intensity signal to represent the interaction between the respective bins.Heatmaps (Data file 5) were generated based on these signals.BUSCO (Benchmarking Universal Single-Copy Orthologs: http://busco.ezlab.org/) [18] was also applied to perform a quality assessment of the genome.The conserved genes (248 genes) existing in six eukaryotes were selected to construct the core gene library for CEGMA [19] evaluation.The evaluation results revealed that the majority of core eukaryotic genes (97.18%) and genes in the BUSCO dataset (99.4%) were successfully identified (Data file 6).

Limitations
Soybean is considered to have undergone an allotetraploidy event [9] that have resulted in 75% of its genes being present in multiple copies [32].Repetitive DNA made up ~54.4% of each genome [33].In this study, 23.38 Gb Illumina short reads (Data file 13) and 25.58 Gb of PacBio long reads (Data file 14) were obtained, providing approximately 22.44× and 24.55× sequence coverage.Although Hi-C sequencing obtained 114.5 Gb of data with a depth of 109.9×, the overall sequencing depth was relatively low, which may result in incomplete genomic information being obtained.
The contig N50 length of the de novo assembled HJ117 genome is 19.32 Mb, and the scaffold N50 reaches 51.43 Mb, indicating that the genome assembly level has achieved the average level of soybean genome assemblies during the same period.However, gaps still exist in the genome.To achieve accurate genome assembly, optical mapping technology could be incorporated, and HiFi sequencing depth could be increased in the later stages.Alternatively, HJ117 genome could be assembled to a telomere-to-telomere level using ONT Ultra-long technology to obtain more comprehensive genomic information for HJ117.

Table 1
Overview of data files/data sets